A stitch in time: Efficient computation of genomic DNA melting bubbles 
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Background: It is of biological interest to make genome-wide predictions of the locations of DNA 
melting bubbles using statistical mechanics models. Computationally, this poses the challenge that 
a generic search through all combinations of bubble starts and ends is quadratic. 

Results: An efficient algorithm is described, which shows that the time complexity of the task is 
O(NlogN) rather than quadratic. The algorithm exploits that bubble lengths may be limited, but 
without a prior assumption of a maximal bubble length. No approximations, such as windowing, 
have been introduced to reduce the time complexity. More than just finding the bubbles, the 
algorithm produces a stitch profile, which is a probabilistic graphical model of bubbles and helical 
regions. The algorithm applies a probability peak finding method based on a hierarchical analysis 
of the energy barriers in the Poland-Scheraga model. 

Conclusions: Exact and fast computation of genomic stitch profiles is thus feasible. Sequences of 
several megabases have been computed, only limited by computer memory. Possible applications 
are the genome-wide comparisons of bubbles with promotors, TSS, viral integration sites, and other 
melting-related regions. 
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I. BACKGROUND 



Models of DNA melting make it possible to compute 
what regions that are single-stranded (ss) and what re- 
gions that are double-stranded (ds). Based on statistical 
mechanics, such model predictions are probabilistic by 
nature. Bubbles or single-stranded regions play an essen- 
tial role in fundamental biological processes, such as tran- 
scription, replication, viral integration, repair, recombi- 
nation, and in determining chromatin structure [TJ [2]. 
It is therefore interesting to apply DNA melting models 
to genomic DNA sequences, although the available mod- 
els so far are limited to in vitro knowledge. Genomic 
applications began around 1980 [31 0], and have been 
gaining momentum over the years with the increasing 
availability of sequences, faster computers, and model 
development. It has been found that predicted ds/ss 
boundaries often are located at or very close to exon- 
intron junctions, the correspondence being stronger in 
some genomes than others [5J El E], which suggested 
a gene finding method [9|. In the same vein, compar- 
isons of actin cDNA melting maps in animals, plants, and 
fungi suggested that intron insertion could have target 
the sites of such melting fork junctions in ancient genes 
[TOl ITT] . In other studies, bubbles in promotor regions 
were computed to test the hypothesis that the stability 
of the double helix contributes to transcriptional regula- 
tion [TJ H3 El H3 UJl EZ] . Bubbles induced by superhe- 
licity have also been found to correlate with replication 
origins as well as promotors (T5J [Tj|J [2U1 HI]- In addi- 
tion to the testing of specific hypotheses, a strategy has 
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been to provide whole genomes with annotations of their 
melting properties [22J [23J . Combined with all other ex- 
isting annotations, such melting data allow exploratory 
data mining and possibly to form new hypotheses [24] . 
For example, the human genomic melting map was made 
available, compared to a wide range of other annotations, 
and was shown to provide more information than the lo- 
cal GC content [23] . 

In the genomic studies, various melting features have 
proved to be of particular interest. These include the 
bubbles and helical regions, bubble nucleation sites, 
cooperative melting domains, melting fork junctions, 
breathers, sites of high or low stability, and SIDD sites. 
Most often we want to know their locations, but addi- 
tional information is sometimes useful, such as probabil- 
ities, dynamics, stabilities, and context. DNA melting 
models based on statistical mechanics are powerful tools 
for calculating such properties, especially those models 
that can be solved by dynamical programming in poly- 
nomial time. For many features of interest, however, al- 
gorithms remain to be developed to do such predictions. 
The existing melting algorithms typically produce melt- 
ing profiles of some numerical quantity for each sequence 
position. The prototypical example is Poland's probabil- 
ity profile [25], but also profiles of melting temperatures 
(melting maps), free energies or other quantities are com- 
puted per basepair. The result can be plotted as a curve, 
while the wanted features often have the format of re- 
gions, junctions and other sites. Some genomics data 
mining tools also require data in these formats rather 
than curves. As a remedy, melting profiles have been sub- 
jected to ad hoc post-processing methods to extract the 
wanted features, such as segmentation algorithms [25] , 
thresholding [22 , and relying on the eye through visual- 
ization PITT]. 

In previous work, we developed an algorithm that iden- 
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tifies regions of four types: helical regions, bubbles (in- 
ternal loops), and unzipped 5' and 3' end regions (tails) 
[2"6"l |2"T1 |2"8] . The algorithm produces a siifc/i profile, 
which is a probabilistic graphical model of DNA's confor- 
mational space. A stitch profile contains a set of regions 
of the four types. Each region is called a stitch, because 
of the way they can be connected in paths. The stitch 
profile algorithm computes the location (start and end) 
of each stitch and the probability of that region being in 
the corresponding state (ds or ss) at the specified tem- 
perature. A stitch profile can be plotted in a stitch profile 
diagram, as illustrated in Fig. [I] The location of a bubble 
or helix stitch is not given as a precise coordinate pair 
(x, y), but rather as a pair of ds/ss boundaries with fuzzy 
locations. For each ds/ss boundary, the range of thermal 
fluctuations is computed and given as an interval. A 
stitch profile indicates a number of alternative configu- 
rations, both optimal and suboptimal, as illustrated in 
Fig. [T] In contrast, a melting map would indicate the 
single configuration at each temperature, in which each 
basepair is in its most probable state. 

A stitch profile thus provides some features, e.g. bub- 
bles, that would be of interest in genomic analyses. How- 
ever, the previously described algorithm for computing 
stitch profiles [55] has time complexity 0(N 2 ). Genomics 
studies often require faster algorithms, both to com- 
pute long sequences and to compute many sequences. 
In this paper, therefore, an efficient stitch profile algo- 
rithm with time complexity 0(N log N) is described, and 
the prospects of computing genomic stitch profiles are 
discussed. The original algorithm [26] is referred to as 
Algorithm 1, while the new algorithm is referred to as 
Algorithm 2. 

The reduction in time complexity has been achieved 
without introducing any approximation or simplification 
such as windowing. The usual tradeoff between speed 
and precision is therefore not involved here. The output 
of Algorithm 2 is not of a lower quality, but identical 
to Algorithm l's output. Algorithm 1 was simply ineffi- 
cient. However, it was not obvious that this problem has 
time complexity O(NlogN), which is the same as com- 
puting melting profiles with the Poland-Fixman-Freire 
algorithm [55]. It would appear that the stitch profile 
had greater complexity, for example, that the search for 
all bubble starts and ends would be quadratic. On the 
other hand, we know that bubbles may be small com- 
pared to the sequence length. Algorithm 2 detects such 
circumstances in an adaptive way, without assuming a 
maximal bubble length. 



II. METHODS 

The proper way of computing DNA conformations, as 
well as other macromolecular structures, is to consider a 
rugged landscape |30[ 131], As an abstract mathematical 
function, a landscape applies to widely different complex 
systems, for example, fitness landscapes in evolutionary 



biology for defining populations and species. The rugged- 
ness implies many local maxima and minima on many 
levels. In optimization, the task would be to avoid all 
the "false" local optima and find the global optimum. 
That is not what we want. On the contrary, we would 
prefer to include most of them. 

A local optimum corresponds to an instantaneous con- 
formation or microstate that is more fit or stable than 
its immediate neighbors. However, fluctuations over time 
cover a larger area in the landscape around the local opti- 
mum, which is defined as a macrostate. A macrostate can 
not simply be associated with a local optimum, because 
it usually covers many local optima. On the other hand, 
a local optimum may be part of different macrostates. 
Fluctuations are biologically important, as they repre- 
sent stability and robustness, rather than noise and un- 
certainty |32j . Conformations are properly represented 
by macrostates, not microstates. We want to character- 
ize the whole landscape of DNA conformations by a set 
of macrostates. 

More specifically, this article considers certain prob- 
ability landscapes, in which the probability peaks are 
the macrostates. The algorithmic task is to find a set 
of peaks. Automatic peak detecting is applied in vari- 
ous kinds of spectroscopy (NMR), spectrometry (mass- 
spec), and image segmentation (e.g. in astronomy), but 
these algorithms usually do not consider any hierarchi- 
cal aspects. Hierarchical peak finding is analogous to 
hierarchical clustering, which is widely used in bioinfor- 
matics. However, our approach is closely related to the 
hierarchical analyses of energy landscapes and their bar- 
riers in studies of dynamics, metastability, and timescales 
[3"51 IM1 1331 [33] , The algorithm uses a subroutine for 
finding hierarchical probability peaks in one dimension, 
described in the next section. 



A. ID peaks 

This section briefly revisits the ID peak finding method 
and the use of a nonstandard pedigree terminology |26] . 
Here is a generic formulation of the problem: Let p{x) 
be some probabilities (possibly marginal) defined for 
x = 1, . . . , N. What are the peaks in p(x)l The com- 
putational task is divided into two steps. The first step 
is to construct a discrete tree of possible peaks, and the 
second step is to select peaks by searching the tree. 

To simplify the presentation, we assume that p(x±) =/= 
p(x2) if x\ ^ X2- Let \& be the set of x- values, where 
p{x) has local minima and maxima. We associate a 
possible peak with each element a E "J. If a is a 
local minimum, the peak is defined as illustrated in 
Fig. [2] The peak location is the extent on the x-axis, 
L(a) = [xstart(a), ^end(a)], defined as the largest inter- 
val including a in which p{x) > p(a). The peak width 
is the size of L(a), p w (a) = x cnd (a) - x stait (a) + 1. 
The peak volume is the probability summed over the 
location, p v (a) = Y^xeL(a) P( a )- The peak's bottom 
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FIG. 1: What is a stitch profile diagram? At the top are sketched three alternative DNA conformations at the same temperature. 
In the middle diagrams, the sequence location of each helical region (blue) and each bubble or single-stranded region (red) is 
represented by a stitch. At the bottom, the three "rows of stitches" are merged into a stitch profile diagram. 
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FIG. 2: Example of a ID peak. This peak in p(x) has peak 
volume (yellow area) p v (a) = 1.5 x 1CP 72 , while the peak 
height is Ph(a) = 2.9 x 10 -73 , which is the maximum proba- 
bility attained at f3a = 1209. The peak location L(a) is the 
extent from Xatart = 1204 to x cn d = 1216, which corresponds 
to the local minimum attained at a = 1212. The depth is 
D(a) = 0.711. 



/3a = arg max Ig £( a ) p{x) is the s-value where p attains its 
maximum. (The term "bottom" originates from the cor- 
responding energy landscape picture, but it is the posi- 



tion of the peak's top.) The peak height is Ph(a) = p{fia). 
The peak's depth is D(a) = log 10 P ^U) ■ We a ^ so asso ~ 
ciate a possible peak with each local maximum a G <F, 
namely the spike itself: L(a) — [a, a], p w (a) = 1, /3a = a, 
p v (a) — Ph(a) — p(a), and D(a) = 0. 

While peaks may be high, it is a more defining char- 
acteristic that they are wide. A peak is produced by 
the fluctuations in x, rather than disturbed by them. 
For each local maximum, there are many possible peaks. 
Therefore, a peak can not be identified with its bottom. 
Instead, we use the elements in 'J as unique identifiers of 
peaks. The location of a peak is i(a), not the bottom po- 
sition /3a, and the size of a peak is the peak volume, not 
the peak height. However, for the second type of peaks 
(the maxima), the peak location reduces to the bottom 
and the peak volume reduces to the peak height. 

The set of possible peaks is hierarchically ordered. 
A binary tree is defined by the set inclusion order on 
the set of peak locations. For each pair a, a' € 'F, cither 
L(a) C L(a'), or L(a) 2 L{a'), or they are disjoint. The 
branching corresponds to each local minimum a dividing 
the peak into two subpeaks, see Fig. [2] just as a barrier 
or a watershed or a saddle point divides two valleys or 
lakes in a landscape [S3J 1351 IM| ■ The global minimum is 
the root node p of the tree. The local maxima are the 
leaf nodes of the tree. Each a € <F has at most three 
edges, one towards the root and two away from the root. 
Each a ^ p has an edge towards the root that connects to 
the successor aa. Each successor has an increased depth: 
D(oa) > D(a). And each local minimum a has two edges 
away from the root that connect to two ancestors. The 
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highest peak of the two ancestors is the father ira and 
the other is the mother p,a, i.e., they are distinguished 
by Phiita) > ph(fia). A left-right distinction between the 
two is not used. The notation o~ n a means the successor 
taken n > times, where a°a = a. Each a has a set of 
successors S(a) defined as the path from a to the root: 
a, aa, a 2 a, . . . , p. Each a also has a set of ancestors A(a) 
defined by a' £ A (a) a £ E(o'). The set A (a) is 
the subtree that has a as its root node. A bottom is 
typically shared by several peaks. For example, a peak 
has the same bottom as its father, f3a — flira, but not the 
same as its mother, /3a ^ (3pa. Each a has a paternal line 
n(a), defined as the set of all nodes that share a's bottom. 
n(a) is also the path including a connected by fathers 
that ends at (3a. The beginning of the path, called the 
full node ipa, is either a mother or the root. The paternal 
lines establish a one-to-one correspondence between the 
set of maxima (i.e. bottoms) and the set of mothers 
including the root. 

Having established a hierarchy 'J of possible peaks, the 
second step is to select among them. The selection ap- 
plies two independent criteria, each controlled by an in- 
put parameter: the maximum depth -D max and the prob- 
ability cutoff p c . The first criterion is that a is a ID peak 
according to the following definition. 



Definition 1. Let D r< 



be the maximum depth of peaks. 



Then a £ \& is a ID peak if 



(i) D(a) < D ri 



(ii) D(aa) > D max or a = p. 



The second criterion is that p v (a) > p c . The first cri- 
terion is invoked by using the maxdeep subroutine 
which returns the set P of all ID peaks. The second 
criterion is subsequently invoked by calculating the peak 
volume of each a £ P and comparing with the probability 
cutoff. 



B. Bubbles and helical regions 



The stitch profile algorithm is separate from the statis- 
tical mechanical DNA melting model. The only interface 
to the underlying model is by calling the following prob- 



ability functions: 
Pri g ht(a;) 

Pleft(y) 

Pbubbic(a;,2/) 

Phelix(l,y) 



P(...XXl 0---0 -3'), (1) 

unzipped 

P(5'- 0---0 1XX...), (2) 

unzipped 

P(...XXl 0---01XX. ..), (3) 

bubble 

P(...XX0l ••• 10XX. ..), (4) 



hcli: 



P(...XX01 •••1-3'), 




P(5'-1^__L0XX. ..). 

zipped 



(5) 
(6) 



In these equations, 1 is a bound basepair (helix), is a 
melted basepair (coil) , X is either or 1 , and the sequence 
positions x and/or y are indicated. 

In addition to these, the stitch profile algorithm calls 
methods for adding these probabilites (peak volumes) 
and for computing upper bounds on such probability 
sums. This means that it is easy to change or replace the 
underlying model. In this article, the Poland- Scheraga 
model with Fixman-Freire loop entropies is used [37], but 
in principle, other DNA melting models could be used, 
or even models that include secondary structure [37]. 

This article discusses how to efficiently compute bub- 
ble stitches and helix stitches only. The 5' and 3' tail 
stitches are efficiently computed as in Algorithm 1 |26j . 
Each bubble stitch corresponds to a peak in the bub- 
ble probability function in Eq. ^ . And each helix stitch 
corresponds to a peak in the helix probability function in 
Eq. Q . These two probability functions and their peaks 
are two dimensional, so the ID peak finding method does 
not directly apply. However, the ID peak analysis can 
be performed for each of the other four probability func- 
tions [Eqs. (U, P, and Using Eq. ffl, a binary 
tree and a set of ID peaks P x is computed, and using 
Eq. (pi) , a binary tree ^ y and a set of ID peaks P y is com- 
putecl The probability cutoff is not invoked here. These 
two tree structures with their ID peaks are then further 
processed, as described in the following two sections, to 
obtain the bubble stitches. Likewise, using Eq. (j5|, a bi- 
nary tree ^ x and a set of ID peaks P x is computed, and 
using Eq. (ml, a binary tree and a set of ID peaks P y 
is computed"! These are used similarly to obtain the helix 
stitches. This division of labor also indicates an obvious 
parallelization of the algorithm using two or four pro- 
cessors. Parallelism was not implemented in this study, 
however. 
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C. 2D peaks 



Proposition 2. If (a, b) is nonroot and a-above, then 



The goal of this section is to define 2D peaks and to 
prove the key result that some 2D peaks are simply the 
Cartesian product of two ID peaks. But not all 2D peaks 
have this property, making it a nontrivial result. This is 
expressed in Theorem [2] 

Theorem [2] also indicates a convenient way of com- 
puting all 2D peaks, on which Algorithm 2 is directly 
based. Theorem [2] shows that Algorithm 2's computa- 
tion of stitch profiles is exact, that is, complying strictly 
with the mathematical definition of 2D peaks. The proof 
is therefore important for the validation of Algorithm 2. 
While Theorem [2] is the primary goal, we also prove The- 
orem [l] which similarly provides validation of Algorithm 
1. But more importantly, a comparison of the two theo- 
rems gives more insight in both algorithms. 

A frame is a pair (a, b) E ^f x x \? y . A frame also refers 
to the corresponding box L(a) x L(b) in the xy-pl&ne. A 
frame (a, 6) is contained inside another frame (a',b'), if 
L(a) x L(b) C L(a') x L(b'), that is, if a' E E(a) and b' E 
£(&). The root frame is (p x , p y ). A frame (a, b) is nonroot 
if (a, b) ^ (p x ,p y ). A frame (a, b) is a bottom frame if 
(a, b) — (/3a, /3b) and it is nonbottom if (a, b) ^ (/3a, /3b). 
The depth of a frame (a, b) is D(a, b) = max{Z?(a), D(b)}. 
From this definition, we immediately get 

D(a, b) < D niax <s> D(a) < D max and D(b) < D max . 

To simplify the presentation, we assume that for all 
frames: D(o) ^ D(b). 

Definition 2. The successor of a nonroot frame (a, b) is 

<?{a, 0) - | ff6 j > ora = Px W 

A successor of the root frame does not exist. 

Having defined the depth and the successor, what is 
the depth of a successor? 

Proposition 1. For every nonroot (a,b), D(a(a,b)) > 
D(a,b). 

Proof. For a(a, b) — (era, 6), ma,x{ D (a a), D(b)} > 
ma,x{ D (a), D(b)} because D(aa) > D(a). Likewise for 
a(a, b) = (a, o~b). □ 

Definition 3. A frame (a, b) is a-above if 

(i) D(o~a) > D(b) or a = p x . 

(ii) D(ab) > D(a) or b = p y . 

The term "cr-above" is a mnemonic for the two in- 
equalities in the definition. The set of all frames that are 
cr-above is called the frame tree. While Prop, [l] only sets 
a lower bound on the depth of a successor, we can write 
the actual value for cr-above frames: 



D(a(a,b)) 



D(aa) ifo~(a,b) = (aa,b) 
D(ab) if cr(a,6) = (a, o~b). 



(9) 



Furthermore, D(a(a,b)) = min{D(o~a), D(crb)} if both 
a 7^ Px and b ^ p y . 

Proof. If cr(a, b) = (era, 6), then a ^ p x and 
max{D(cra), D(b)} = D(aa) by Def. [3] If, furthermore, 
b ^ p y , then D(a(a,b)) = D(aa) < D(crb) by Def. [2] 
Likewise if cr(a, b) = (a, o~b). □ 

By repeatedly taking the successor, we eventually 
end up at the root frame in, say, R steps. S(a, b) is 
the sequence of successors of (a, b), i.e., the sequence 
{cr n (a, b)}$ that begins at (a, b) and ends at the root 
frame. Alternatively, S(a, b) is defined as the set of suc- 
cessors, i.e., the set of such sequence elements. What 
if we want to exclude (a, b) from S(a, &)? That can be 
written as E(cr(a, &)). 

If (a, b) is not a-above, then its sequence of succes- 
sors takes the shortest path to a a-above frame, or put 
another way: 

Proposition 3. If a' E S(a), b' E S(6) and (a',b') is 
a-above, then (a', 6') E S(a, b). 

Proof. All elements in both S(a) and S(&) are visited by 
the sequence E(a,6) on its climb to the root frame. As- 
sume (a', b') ^ S(a, b). Then either a' is passed before b' 
is reached, or viceversa, and we can assume that a' comes 
first. In other words, a' ^ p x and there is a b" =^ b' such 
that b' E £(&") and a(a',b") = (aa',b"). Then D(b') > 
D(ab"). ByDef.[2) we see that D(ab") > D(aa'). (a',b') 
is a-above, so by Def. [3] we see that D(aa') > D(b'). We 
arrive at the contradiction D(b') > D(b'). □ 

Each frame is the successor of at most four frames. 
If (a, b) — a(a',b') then (a',b') is either (ira, b), (a,wb), 
(pa,b), or (a, p,b). Two of these are defined as ancestors: 

Definition 4. The father of a nonbottom frame (a, b) is 



7r(a, b) 



(Tta,b) ifD{a) > D(b) 
(a,nb) ifD(a) < D(b) 



The mother of a nonbottom frame (a, b) is 
(pa,b) ifD(a) > D(b) 



(10) 



^ a ^ = \{a,pb) l fD{a)<D{b). < U ) 

Fathers and mothers of bottom frames do not exist. 

Each father or mother can have its own father and 
mother, and so on. The set of ancestors A(a, b) is the 
binary subtree defined recursively by: (1) (a, b) E A(a, b). 
(2) If nonbottom (a', b') E A(a, b) then n(a' , b') E A(a, b) 
and p(a',b') E A(a,b). 

The next proposition shows that being a-above is prop- 
agated by a, 7r, and p,: 
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Proposition 4. Let (a, b) be a-above. 

(i) If (a' , b') £ S(a, b) then (a', b') is a-above. 
(ii) If(a',b') £ A(a, b) then (a', b') is a-above. 

Proof, (i): First, we show that a(a,b) is a-above: If 
a(a,b) = (era, 6), then Def. [2] implies the second con- 
dition: D(ab) > D(aa) or 6 = p y . And (a, 6) is er-above 
which by Def. [3] implies the first condition: D(a 2 a) > 
D{aa) > D(b) or era = p x . Similarly, a(a, b) — (a,ab) is 
shown to be er-above. The proof is completed by induc- 
tion. 

(ii) : First, we show that 7r(a, b) is er-above: If 
7r(a, b) — (na,b), then Eq. (To| implies the first condi- 
tion: D(aira) — D(a) > D(b) or 7ra = p x . And (a, 6) 
is er-above which by Def. [3] implies the second condi- 
tion: D(ab) > D(a) > D(ira) or b — p y . Similarly, 
7r(a, b) = (a, irb) and p(a, 6) are shown to be er-above. 
The proof is completed by induction. □ 

Successors are the inverse of fathers and/or mothers 
for er-above frames only: 

Proposition 5. // (a, b) is nonbottom and nonroot, the 
following statements are equivalent: 

(i) (a, b) is a-above 

(ii) air(a, b) = (a, 6) 

(Hi) ap(a,b) = (a, 6) 

(iv) ira(a, b) — (a, 6) or pa(a, b) = (a, b) 



Proof, (i) <^> (ii): If 7r(a, 6) = (it a, b), then Eq. (10 1 im- 
plies the first condition that (a, b) is er-above: D(aa) > 

D(a) > D(b) or a = p x . Then (a, b) is a-above ^I^P 

D(ab) > D(a) = D(aira) or b = p y er(7ra, 6) = 

(aira,b) a7r(a,6) = (a, b). If 7r(a, 6) = (a, irb), the 
equivalence is shown similarly. 

(i) (Hi): Replace ir by p in the above. 

(i) (iv): If a(a,b) = (era, b), then Def. [2] implies 
the second condition that (a, b) is a-above: D(ab) > 

D(aa) > D(a) or b = p y . Then (a,b) is a-above D '' 



D(aa) > D(b) 7r(aa, b) = (iraa 7 b) or p(aa,b) = 

(paa,b) ■<=>• ira(a,b) = (a, b) or pa(a,b) = (a, b). If 
a(a, b) = (a, ab), the equivalence is shown similarly. □ 

Accordingly, there is an "inverse" relationship between 
the sets of successors and ancestors: 

Proposition 6. (a',b') is a-above and (a,b) £ E(a',b') 
iff (a, b) is a-above and (a',b') G A(a, b). 

Proof, (a, b) € S(a', b') implies a path of successors from 
(a',b') to (a, b). Prop. [4] shows that all elements in the 
path are a-above. Prop. |5[iv) applied to each step in the 
path gives an opposite path of ancestors. 



Conversely, (a', b') € A(a, b) implies a path of ances- 
tors from (a, b) to (a', b'). Prop.Hshows that all elements 
in the path are a-above. Prop.]5lii) and (iii) applied to 
each step in the path gives an opposite path of succes- 
sors. □ 

It follows from Prop. [6] that the frame tree is equal to 
the binary tree A(p x ,p y ), because (p x ,p y ) £ T,(a',b') for 
any (a',b'). It has the same pedigree properties as 'J, 
such as paternal lines and f3n(a,b) = (3(a,b). 

So far, we have covered ground that was already im- 
plicit in |26j . but augmented here with proofs. The next 
concept is new, however, namely the Cartesian products 
of ID peaks. 

Definition 5. (a, b) is a grid frame if a and b are ID 
peaks. 

The set of all grid frames is G = P x x P y . As Fig 
shows, G has a grid-like ordering in the xy-plane. A 
ID peaks a £ P x have disjoint peak locations L(a) — 
[z s tart(a), Zend (a)]- They can be indexed by i = 1, 2, 3, . . . 
according to their ordering from 5' to 3' on the sequence, 
such that x en d(ai) < x s tart(oi+i)- Likewise, the ID peaks 
b £ P y can be indexed by j. Then the grid frames form 
a matrix G with elements [G]ij = (a,i,bj). We use the 
symbol G for both the set and the matrix. 

Proposition 7. Every grid frame (a, b) is a-above. 

Proof. If a ^ p x , then D(aa) > -D max because a is a 
ID peak and D max > D(b) because b is a ID peak 
(see Def. [T]), thus showing Def. [3](i) . Similarly, we show 
Dcf.^ii). □ 

The following two lemmas show that grid frames in- 
herit some properties from ID peaks. 

Lemma 1. (a, b) is a grid frame iff 

(i) (a, b) is a-above, 

(ii) D(a,b) < D m 

(iii) D(a(a,b)) > D max or (a, b) is the root frame. 

Proof. If (a, b) is a grid frame, then it is a-above by 
Prop.[7]and Eq. ^ implies D(a, b) < D m&x . For nonroot 
(a, b), D(a(a, b)) equals either D(aa) or D(ab) (Prop.[2j, 
which is > -Dmax because a and b are ID peaks. 

Conversely, Eq. Q implies D(a) < D max . For a = p x , 
a is then a ID peak. For a ^ p x , Prop. [2] gives D(aa) > 
D(a(a,b)) > -D max , so a is a ID peak. Similarly, b is 
shown to be a ID peak. □ 



Lemma 2. Let D,, 



be the maximum depth of peaks. 



(i) For each a with D(a) < D max , there is exactly one 
ID peak a' £ S(a). 

(ii) For each (a, b) with D(a, b) < D rnax , there is exactly 
one grid frame (a',b') £ S(a, b). 
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FIG. 3: The set G = P x x P y of all grid frames plotted in the xy-plane. The grid frames are colored to distinguish those that 
are above the diagonal (green), crossing the diagonal (red), and below the diagonal (grey), thus illustrating the subsets G a , G c 
and Gb, respectively. Frames with side lengths below 20 bp are not shown to unclutter the figure. 



Proof. 0: The depth increases monotonically in the se- 
quence E(a) of successors ( Vn : D(a n a) < D(a n+1 a)). 
For D(p x ) > .D max , there is therefore a unique element 
a' 7^ p x with D(a') < D m&x and D(oa!) > -D max - For 
D(p-x) < -Dmax, a! — p x is a ID peak and no other ele- 
ment in S(a) can fulfill Dcf. [TJii). 

@: Eq_^]) gives D(a) < D max and D(6) < D roax . By 
applying (II I to a and we obtain a unique grid frame 
(a',b') where a 1 £ £(a) and b' E S(a). (a',b') is cr-above 
by Prop. [7j so (a', 6') S E(a, 6) by Prop. |3] □ 

How do we define 2D peaks? A straightforward way 
would be to generalize ID peaks by simply rewriting 
Def. [l] in the frame tree context. The result would be 
the grid frames, as we see by Lemma [T| However, there 
is more to the picture than the frame tree, due to a fur- 
ther constraint to be discussed next, which requires a 
more elaborate definition of 2D peaks. 

In genomic annotations, a region is specified by coor- 
dinates x..y, where by convention x < y, i.e., x is the 5' 
end and y is the 3' end. We adopt the same constraint 
for our notation (x,y) of the instantaneous location of 
a bubble or helix. In the aiy-plane, helices are only de- 
fined for (x,y) above the diagonal line y = x. Bubbles 
have at least one melted basepair in between x and y, so 



they are only defined for (%, y) above the diagonal line 
y = x + 1. Accordingly, we require that frames are above 
the diagonal line, as defined in the following. 

Definition 6. A frame (a, b) is above the diagonal if 
Xend(a) + 1 < y s tart(b) for bubbles, (12a) 

Xend(a) < y st artib) for helices. (12b) 
A frame (a, b) is below the diagonal if 

Xstart(a) + 1 > y en d(b) for bubbles, (13a) 



x s tart(a) > y en d(b) for helices. 



(13b) 



A frame (a, b) is crossing the diagonal if it is neither 
above the diagonal nor below the diagonal. 

Note: A frame that is crossing the diagonal contains 
at least one point (x, y) above the diagonal line, while a 
frame that is below the diagonal contains no points above 
the diagonal line, but its upper left corner may be on the 
diagonal line. Figure [3] illustrates frames that are above, 
crossing and below the diagonal. 
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The requirement that a frame is above the diagonal 
puts a constraint on its size. This is embodied in the 
next concept. 

Definition 7. The root frame is a fractal frame if it is 
above the diagonal. A nonroot frame (a, 6) is a fractal 
frame if 

(i) (a,b) is above the diagonal, 

(ii) a(a,b) is crossing the diagonal, 

(Hi) (a,b) is a-above. 

The set of all fractal frames is denoted F. As Fig. [4] 
shows, fractal frames tend to be smaller the closer they 
are to the diagonal, thus resembling a fractal. For a typ- 
ical fractal frame, the fluctuations in x and y are com- 
parable in size to the length y — x of the bubble or helix 
itself. Indeed, the two peak locations L(a) and L(b) are 
as wide as possible, while not overlapping each other (be- 
cause the successor is crossing the diagonal). In contrast, 
the fluctuations for grid frames are relatively small on av- 
erage and independent of the bubble or helix length. 

Lemma 3. For each a-above and above the diagonal 
(a, b), there is exactly one fractal frame (a', b') € S(a, 6). 

Proof. Let (a', b') = cr n (a, b), where n is the largest num- 
ber for which a n (a, b) is above the diagonal, (a' , b') is a- 
above by Prop.]!] For all m > n, frames a m (a, 6) (if they 
exist) are not above the diagonal, nor below the diago- 
nal because they contain (a, b), hence they are crossing 
the diagonal. Therefore (a', b') is a fractal frame. For all 
m < n, frames o~ m (a, b) (if they exist) are above the di- 
agonal, because they are contained in (a',b'). Therefore 
(a', b') is the only fractal frame in E(a, b). □ 

Lemma [3] is similar to Lemma [2] By Prop. [6] we 
can express both lemmas in terms of ancestors A in- 
stead of successors X. The lemmas then say that cer- 
tain kinds of frames are organized as forests. A forest is 
a set of disjoint trees. The sets F and G generate two 
forests: [J, 6 x eG A(a, b) consists of the subtrees having 
grid frames as root nodes. (J^ a b ^ eF A(a, b) consists of the 
subtrees having fractal frames as root nodes. By these 
forests, we generate from G the set of all u-above frames 
with D(a,b) < -D max , and we generate from F the set of 
all a-above frames above the diagonal. 

All the necessary concepts are now in place for the defi- 
nition of 2D peaks. We will not repeat the "derivation" of 
2D peaks given in [25], but just recall that 2D peaks are 
defined with a purpose: They must capture the extent of 
the actual peaks in the probability functions Pbubbie^, y) 
and Pheiix(#, y). And they must have an interpretation in 
terms of fluctuations on a given timcscalc. The following 
definition is equivalent to the formulation in [26 . 

Definition 8. Let D rnax be the maximum depth of peaks. 
A frame (a, b) is a 2D peak if 



(i) (a, b) is above the diagonal, 
(ii) (a, 6) is a-above, 
(Hi) D(a,b) < D max , 

(iv) D(a(a,b)) > D max or (a, b) is a fractal frame. 

Note: the or in the definition is not an exclusive or. 
A 2D peak (a, b) can both be a fractal frame and have 
D(a(a, b)) > D max . The set of all 2D peaks is denoted P 
and is illustrated in Fig. [5] 

Comparing Def. [8]and Lemma[l] we see that the differ- 
ence between 2D peaks and grid frames is due to the di- 
agonal constraint: First, the requirement that 2D peaks 
are above the diagonal, and second, the possible exemp- 
tion from the second inequality, which for grid frames is 
being the root frame, while for 2D peaks it is being a 
fractal frame. Unlike grid frames, 2D peaks can capture 
events close to the diagonal by adapting their size. 

Computing the 2D peaks is at the core of the stitch 
profile methodology. The following two theorems provide 
characterizations of 2D peaks that may be translated into 
computer programs. 

Theorem 1. We divide 2D peaks into two types, being 
fractal frames or not, that can be distinctly characterized 
as follows. 

(i) (a, b) is a 2D peak and a fractal frame iff (a, b) is a 
fractal frame and D{a,b) < D max . 

(ii) (a, b) is a 2D peak and not a fractal frame iff (a, b) 
is a grid frame and there is a fractal frame {a! ,b') 
with D(a' ,b') > D max , such that (a', b') £ S(a, b). 

Proof. (JT|): Immediate by Defs. [7] and [8] 

((2): If a 2D peak (a, b) is not a fractal frame, then 
D(a(a, b)) > £> max by Def. [8] so (a, b) is a grid frame by 
Lemma [T| Applying Lemma [3] there is a fractal frame 
(a',b') G E(a, b). (a, 6) ^ (a',b') because one is a fractal 
frame, the other is not, so (a',b') € E(tr(a, &)), which by 
Prop. [I] implies D(a',b') > D mgcK . 

Conversely, (a, b) is above the diagonal because it is 
contained in a fractal frame, (a, b) ^ (a', b') because 
D(a,b) < D max and D(a',b') > L> max , implying that 
(a, b) is not a fractal frame (uniqueness by Lemma [3]) 
and not the root frame. The other requirements for a 2D 
peak are established by Lemma [T] □ 

Theorem [T] characterizes all 2D peaks by their rela- 
tionship to fractal frames. This is applied in Algorithm 
1, that derives all 2D peaks from fractal frames. How- 
ever, the next theorem shows that some 2D peaks can be 
characterized without referring to fractal frames. 

Theorem 2. A nonroot 2D peak has a successor, the 
depth of which is either greater or less than D max . We 
thus divide 2D peaks into two types, that can be distinctly 
characterized as follows. Let (a, b) be nonroot. Then 
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FIG. 4: The set F of all fractal frames plotted in the sj/-plane. The fractal frames (a, b) £ F are colored to distinguish those 
with depths D(a,b) > -Dmax (grey) and D(a,b) < D m ax (blue), thus illustrating the subsets and F s , respectively. Frames 
with side lengths below 20 bp are not shown to unclutter the figure. 



(i) (a, b) is a 2D peak and D(a(a, b)) > D rnax iff (a, b) 
is a grid frame that is above the diagonal. 

(ii) (a, b) is a 2D peak and D(o~(a, b)) < D max iff (a, b) 
is a fractal frame and there is a grid frame {a! ,b') 
that is crossing the diagonal, such that {a! ,b') 6 
E(o,6). 

Proof, ([lj: Immediate by Def. [8] and Lemma [T| 

pi: If a 2D peak (a,b) has D(a(a,b)) < -D max , then 
(a, b) is a fractal frame by Def. [8] Applying Lemma [2] 
to a(a,b), there is a grid frame {a', b') £ £(cr(a, b)) C 
£(a, b). Frame (a',b') is crossing the diagonal because it 
contains a(a,b), which is crossing the diagonal because 
(a, b) is a fractal frame. 

Conversely, (a, b) ^ (a',b') because (a, b) is above the 
diagonal (a fractal frame) and (a',b') is crossing the di- 
agonal, and hence (a',b') e T,(a(a,b)). Since (a',b') is a 
grid frame, Lemma [l] gives D(a',b') < D max , which by 
Prop, [l] implies D(a,b) < D(a(a,b)) < D max , and we 
conclude that (a, b) is a 2D peak. □ 

Note: Theorem |2] does not consider the root frame. 
However, if the root frame is a 2D peak, then it is of the 
first type: a grid frame that is above the diagonal. 



It follows from Theorems [T] and [2] that a 2D peak is 
either a grid frame, a fractal frame, or both. The set 
of 2D peaks P can therefore be divided into three dis- 
joint sets defined as follows. Pf are the 2D peaks that 
are fractal frames only, not grid frames. Pfg are the 2D 
peaks that are both fractal frames and grid frames. Pq 
are the 2D peaks that are grid frames only, not fractal 
frames. Let G a , Gb and G c be the sets of grid frames that 
are above, below and crossing the diagonal, respectively. 
Let Fd and F s be the sets of fractal frames that are deep 
(D(a,b) > -D max ) and shallow (D(a,b) < -Dmax), respec- 
tively. In Figs. [3] [5j all these subsets are illustrated with 
different colors. The following corollary summarizes the 
relationships between grid frames, fractal frames and 2D 
peaks: 



Corollary 1. The set of 2D peaks is P = F S U G a . The 

intersection between the grid and the fractal is Pfg — 
F s n G a — F n G. Furthermore, the 2D peaks can be 
obtained by the following two expressions, in which all 
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FIG. 5: The set P of all 2D peaks plotted in the xy-p\ane. The 2D peak frames are colored to distinguish those that are 
fractal frames (blue), fractal frames and grid frames (black), or grid frames (green), thus illustrating the subsets Pp, Pfg and 
Pg, respectively. Frames with side lengths below 20 bp are not shown to unclutter the figure. 



set unions are between disjoint sets: 

P = F s (J GnA{a',b'), (14a) 

(a',b')eF d 

= G a [J FnA(a',b'). (14b) 

(a'b')EG c 



Proof. P = P F U Pfg U P g . Theorem [T] states that P F U 
Pfg = F s and that 

P G = (J GnA( fl ',6')- 

(a',V)eF A 

Here, A(a' ,b') is brought into play by Prop. [6] Theorem 
[2] states that Pfg U Pg = G a (the root frame would go 
here) and that 



P P = (J PnA(a',6')- 

(a',6')eG c 



□ 



Eqs. (14a I and (14b I outline how the set of 2D peaks 



is built up computationally by Algorithm 1 and 2, re- 
spectively. Writing the expressions side by side shows 
the parallels: Algorithm 1 takes some fractal frames and 



then it adds some grid frames that are contained in- 
side fractal frames. Algorithm 2 takes some grid frames 
and then it adds some fractal frames that are contained 
inside grid frames. In both cases, the additional part 
is the more complicated part, as it requires searching 
some forests. The two Algorithms are algorithmically 
equivalent in terms of output, but the transformation in 



Eq. (14) from P-based to G-based facilitates a reduction 



in execution time, as described in the next section. 



D. The fast and exact algorithm 

Algorithm 2 owes its speed to two important ingredi- 
ents: One is the grid frame matrix G associated to the 
parameter P m ax • The other is an upper bound associated 
to the parameter p c . 

To compute all bubble stitches of the stitch profile, the 
algorithm must find those 2D peaks (a, b) in the bubble 
context that have a peak volume 

p v (a,b)= ^2 ^2 PbubbicO,y) (15) 

x£L(a) y£L(b) 

that is greater or equal to the probability cutoff p c . Ac- 
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cording to Eq. (14b), one can write an algorithm for ob- 



taining all 2D peaks using two nested loops that goes 
through all matrix elements (aj, bj) of the grid frame ma- 
trix G: If (a,i,bj) is above the diagonal, it is a 2D peak. 
If (dj, bj) is crossing the diagonal, a subroutine computes 
the set F D A(ai,bj). If (cii,bj) is below the diagonal, 
it is skipped. By piping the resulting frames through a 
probability cutoff filter, we obtain the bubble stitches. 

The matrix G is not stored in memory, only the two 
arrays P x and P y that provide each a, and bj. Matrix 
elements (di,bj) being above, crossing or below the diag- 
onal refers to the diagonal line in the xy-pleme, never the 
diagonal of the matrix. For each row and column of the 
matrix there may be zero, one, or more matrix elements 
that are crossing the diagonal, as can be seen in Fig. [3] 

More specifically, let G be of order m x n and let the 
outer loop be over j = n to 1 and the inner loop over 
i = m to 1. The iteration thus begins at the upper right 
corner of Fig. [3] and steps along the y-axis in the outer 
loop and the x-axis in the inner loop. However, we do 
not have to start at i = m for each j. If (ai, bj) is below 
the diagonal, then (0^,6^) is below the diagonal for all 
k < j. Therefore, we can jump directly to the i that 
corresponds to the first grid frame that was not below 
the diagonal at the previous j. In this way, most of the 
grid frames that are below the diagonal are ignored by 
the algorithm. While this is a trivial programming trick, 
we shall now see a less trivial trick, that ignores most of 
the grid frames that are above the diagonal. 

Recall [27] that the bubble probability is 



Pbubbie(a, b) 



Z xi0 (x)n(y - x)Z 01x (y) 



(16) 



The loop entropy factor S7(y — x) is a monotonically 
decreasing function. Its largest value in a frame (a, b) 
is therefore in the lower right corner, i.e. f2 m ax = 
fi(y s tart(&) - Zend (a)). Then 



p v (a, b) < 



z xw(x) Z Qlx (y) 



r6L(a) 



K y£L(b) 



and the bubble peak volume has an upper bound that 
factorizes. Using the ID peak volumes 



Pv(a) = Y Z xw{x)/Z, 

x£L(a) 

Pv(b) = Y z oix(y)/z, 

yeUb) 



(17a) 
(17b) 



we can write the upper bound as 

Pv(a,b) = n(2/start(&) ~ x cnd (a))Zp v (a)p v (b). (18) 

If a grid frame [a^bj) has an upper bound below the 
cutoff, p v (a,i,bj) < p c , then also p v (ak,bj) < p c for all 
k < i for which p v (a,k) < P-vifli), because the loop entropy 
factor is decreasing. In that case, their peak volumes are 



also below the cutoff, of course, and the algorithm can 
reject all these frames. 

We implement this observation by calculating in ad- 
vance the next bigger goat nbg(z) defined by 

(i) p v (a k ) < Pv(oj) for nbg(i) < k < i 

(ii) p v (a nhg{l) ) > p v (di) 

The nbg(z) is calculated as follows: A loop over i = 1 
to m compares each p v (cii) successively to p v (a.;_i), 
Pv(a n b g (i-i)), Pv(a„b g (nbg(i-i))), • • • until a bigger one is 
found or the list ends. 

For grid frames (a,i,bj) that are above the diagonal, 
the algorithm first checks if pyifli, bj) < p c , in which case 
it jumps directly to (a n bg(i) j &j ■ The nbg(i) may be un- 
defined, if there are no bigger p v {a k ), in which case the 
inner loop is done and the outer loop proceeds to the next 
j. On the other hand, if p v (a>i, bj) > p c , then the peak 
volume has to be calculated and checked. Although grid 
frames may be skipped without having calculated neither 
their peak volumes nor their upper bounds, the criterion 
for rejection is exact. There are no false negatives (or 
positives). 

For each grid frame (ai,bj) that is crossing the di- 
agonal, the algorithm calculates a set of 2D peaks, 
F n A(aj, bj), and checks the peak volume of each. This 
set consists of all fractal frames that are contained inside 
(a,i,bj). A mental picture is that (otj, bj) must be broken 
into fractal frames (fractured) to avoid crossing the diag- 
onal. The algorithm searches the subtree A(a,i,bj) top- 
down (breadth- first) with a recursive subroutine. A given 
input frame (a, b) is split into its father frame 7r(a, b) and 
mother frame fi(a,b). Each in turn is then checked as 
follows: If it is crossing the diagonal, it is further split 
by giving it recursively as input to the subroutine. If in- 
stead it is above the diagonal, it is a fractal frame. With 
(di,bj) as input, the subroutine finds F (1 A(ai,bj). (If 
instead the input is the root frame (p x ,p y ), the subrou- 
tine will find all fractal frames F. This was applied in 
Algorithm 1.) 

Figure [6] shows the resulting search process, by plot- 
ting only frames that are processed by the algorithm, 
while the ignored grid frames are blank. Comparing with 
Fig. [3] we see that the blank areas correspond to the great 
bulk of grid frames both above and below the diagonal, 
leaving just an irregular band of frames along the diago- 
nal to be searched. This is a nice geometric illustration 
of the reduction from 0(N 2 ) to 0(N log N) in execu- 
tion time. Figure [6] also shows that some bubble stitches 
are fractal frames contained inside grid frames that are 
crossing the diagonal. 

The peak volumes p v (a, b) of some frames must be cal- 
culated. Algorithm 2 spends a considerable fraction of its 
time on doing these summations. The summation over 
a bubble frame can be done faster if the frame is big 
enough, by exploiting the Fixman-Freire approximation 
d la Yeramian |29[ 138] . This does not improve the time 
complexity, but significantly reduces the total execution 
time by some factor. 
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FIG. 6: The footprints of Algorithm 2 plotted in the :rj/-plane. These are frames that are visited by Algorithm 2 during its 
search for the bubble stitches (filled yellow). The frames are located in a band along the diagonal, suggesting that the search 
space is proportional to sequence length. Grid frames below the diagonal (grey) are skipped. Grid frames crossing the diagonal 
(red) are broken into fractal frames (blue). The bubble stitches (filled yellow) are those grid frames above the diagonal (green) 
and fractal frames (blue) that have p v {a, b) > p c - Frames with side lengths below 20 bp are not shown to unclutter the figure. 



To compute all helix stitches of the stitch profile, 
the algorithm follows exactly the same proced ure a s de- 
scribed above, but in the helix context. Eq. (14b) and 



the analysis in the previous section applies equally well to 
the bubble and the helix contexts. The various quantities 
are, of course, replaced by their helix counterparts. For 
example, the appropriate diagonal line is applied (Def|6]). 
The main difference is the upper bound on helix peak 
volume. Since x and y decouples in the helix probability 



Phelix(z,2/) 



Phelix(z, AQPhelbc(l,i/) 
p h elix(l, -W) 



(19) 



we can simply use the peak volume as its own upper 
bound: 



p v (a,b) = p v (a,b) 



p v (a)p v (b) 



(20) 



Phelix(l,-/V) 

The 3 (a;, y) factor [53] is the counterpart of Q(y — x), but 
an explicit consideration of its monotonicity is not neces- 
sary here, because it is absorbed in the above quantities. 
A next bigger goat is then calculated and applied in the 
same way as for bubbles. 



III. RESULTS AND DISCUSSION 
A. Time complexity 

By inspection of Algorithm 2, we observe that it visits 
at least O(N) and at most 0(N 2 ) matrix elements of G. 
Furthermore, it performs sorting, which is known to scale 
as O(NlogN). The time complexity is therefore between 
0(N log N) and 0(N 2 ). The execution time depends on 
the fraction of ignored grid frames above the diagonal, 
which depends on the specific sequence, temperature, and 
other input parameters. A theoretical analysis of these 
dependencies is complicated. 

Empirical testing of the execution times were done in- 
stead, using a test set of 14 biological sequences with 
lengths selected to be evenly spread on a log scale span- 
ning three decades. A minimum length of 1000 bp was re- 
quired. Most of the test sequences are genomic sequences, 
so as to represent the typical usage of the algorithm. The 
sequence lengths and accession numbers are: 

• 1168 bp [GenBank:BC108918] 
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FIG. 7: Algorithm 1 is quadratic and Algorithm 2 is linear. 
The log-log plot shows the execution time versus sequence 
length of Algorithm 1 (red) and Algorithm 2 (blue). The 
straight lines are fits to the data points with slopes 1.97955 ± 
0.02923 (red) and 0.99953 ± 0.02016 (blue). 

• 1986 bp [GcnBank:BC126294] 

• 4781 bp [GenBank:BC039060] 

• 7904 bp [GenBank:NC_001526] 

• 16571 bp [GenBank:NC_001807] 

• 36001 bp [GenBank:AC_000017] 

• 48502 bp [GenBank:NC_001416] 

• 85779 bp [GenBank:NC_001224] 

• 168903 bp [GenBank:NC_000866] 

• 235645 bp [GenBank:NC_006273] 

• 412348 bp [GenBank:AE001825] 

• 816394 bp [GcnBank:NC_000912] 

• 1138011 bp [GenBank:AE000520] 

• 2030921 bp [GenBank:NC_004350] 

The algorithms were written in Perl and run on a Pen- 
tium 4, 2.4 GHz, 512 KB cache, 1 GB memory, PC with 
Linux (CentOS). In Fig. [7] the speeds of Algorithms 1 
and 2 are compared. Algorithm 2 is orders of magnitude 
faster than Algorithm 1 for sequences longer than 100 
kbp. While all the 14 sequences were computed by Al- 
gorithm 2, the three longest sequences were aborted by 
Algorithm 1, because of too long execution times. To en- 
sure that the computational tasks were comparable, all 
sequences were computed at their melting temperatures 
T m , rather than one temperature for all, such that all 
sequences had the same fractions of helical regions and 
bubbles. For both algorithms, straight lines were fitted 



FIG. 8: Algorithm 2 is fast at all temperatures. The total 
execution times are plotted versus sequence length for each of 
the listed helicity values. 



to the data in the log-log plot. For Algorithm 2, however, 
the longest sequence (2 Mbp) is considered an outlier and 
thus excluded from the fit. This sequence's execution 
time was overly increased, because the required memory 
exceeded the available 1 gigabyte RAM. For Algorithm 1, 
the slope of the fit is 1.97955±0. 02923, suggesting that it 
has time complexity 0(N 2 ). For Algorithm 2, the slope 
of the fit is 0.99953 ± 0.02016. This is interpreted as 
the time complexity 0(N log N), but with the logarith- 
mic component being too weak to distinguish 0(N log N) 
from O(N). 

The execution time of Algorithm 2 is just as much a 
property of the underlying energy landscape depending 
on the input, as it is a property of the algorithm. Could 
it be that other input parameters and/or sequences than 
was used in Fig. [7] — say, away from the melting points — 
would exhibit the time complexity 0(N 2 )? Figure M 
shows the speed of Algorithm 2 over the whole melting 
range of temperatures. Each sequence in the test set was 
computed at temperatures corresponding to the helicity 
values: 0.9995, 0.999, 0.995, 0.99, 0.95, 0.9, 0.8, 0.7,..., 
0.2, 0.1, 0.05, 0.01, 0.005, 0.001, and 0.0005. This helic- 
ity range approximately corresponds to the temperature 
range T m ± 10°C and it covers most of the melting tran- 
sitions. Although the curves for the individual helicity 
values may not be easily distinguished in Fig. [8] it ap- 
pears that all curves have similar slopes and that they 
are close to each other, i.e., the variation in execution 
time is below 50%. This indicates that the helicity (or 
temperature) value has only a small influence on the total 
execution time. The time complexity O(NlogN) seems 
to be robust. 

However, a stronger temperature dependence is re- 
vealed when considering the computations of bubble 
stitches and helix stitches separately. Two independent 
subroutines of Algorithm 2 compute the bubble stitches 




T(°C) 



FIG. 9: Bubble and helix execution times versus temper- 
ature. For the sequence [GenBank:NC_001807], the bubble 
time (red) and helix time (green) divided by sequence length 
(16571 bp) is plotted versus T. The melting curve (blue) 
shows the helicity O (on the right vertical axis) as a function of 
T, indicating the melting midpoint: = 0.5 at T m = 83.7°C. 
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FIG. 10: Sequence-averaged bubble and helix execution 
times. The bubble time per basepair (red) and helix time 
per basepair (green) averaged over all sequences are plotted 
versus the normalized temperature r for each of the helic- 
ity values listed in Fig. [8] The melting curve (blue) shows 
the helicity (on the right vertical axis) as a function of r, 
indicating the melting midpoint: = O.5atr = O. 



defined such that the melting curve becomes a sigmoid: 



and the helix stitches, both following the procedure out- 
lined in the previous section. The rest of Algorithm 2's 
computation, including the initial computation of at least 
four partition function arrays |27j . is called the over- 
head. Correspondingly, the total execution time itotai 
is the sum of the bubble execution time ^bubble, the helix 
execution time ihciix, and the overhead execution time 
^overhead- By simply switching off the bubble subrou- 
tine (i.e. ^bubble = 0) and measuring the total execution 
time, we obtain ihciix + ^overhead- Likewise, by switching 
off the helix subroutine, we measure ^bubble + £ overhead- 
In the following, we refer to tbubbic + ^overhead as the bub- 
ble time and ihciix + ^overhead as the helix time. As an 
example, Fig. [9]shows the results for the 16571 bp [Gen- 
Bank:NC_001807]. The bubble and helix times are di- 
vided by sequence length and plotted as a function of 
temperature. Both of them have clearly a strong tem- 
perature dependence. The melting curve is also plotted 
in Fig. [9] indicating that most of the melting occurs in 
the temperature range 80-85°C. Plots like Fig. [9] were 
made for each sequence in the test set, but the average 
behavior is more interesting. To average times of the or- 
der O(N) over sequences of different lengths, one should 
divide them by sequence length as in Fig. [9] However, to 
plot as a function of temperature would not be meaning- 
ful, because the sequences have different T m 's and differ- 
ent melting ranges. On the horizontal axis, instead, we 
use a normalized temperature, 
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For each r- value (or equivalently for each O- value), the 
bubble times and helix times divided by sequence length 
averaged over all sequences are plotted in Fig. [10] The 
curves have a similar temperature dependence as in 
Fig. [9j The helix time decreases monotonically (except 
for a shoulder), while the bubble time increases mono- 
tonically (except for a shoulder) . Both of them have an 
about four-fold difference between their maximum and 
minimum. Qualitatively, the curves are kind of mir- 
ror symmetric, but the helix time is generally greater 
than the bubble time, the two curves cross each other at 
= 0.12. It seems that adding the two curves would give 
a more or less horizontal curve, i.e., the total execution 
time has much less temperature dependence. 

We may understand this interchange between bubble 
time and helix time in terms of the melting process. If 
we assume that the bubble time is proportional to the 
area of the footprint in Fig. [6j and that this is propor- 
tional to the average length of potential bubbles at that 
temperature, then we would expect the bubble time to in- 
crease with temperature, because bubbles grow as DNA 
melts. Likewise, we would expect the helix time to de- 
crease with temperature, because helical regions diminish 
as DNA melts. 

In this article, Blake & Delcourt's parameter set [39] as 
modified by Blossey & Carlon [40] was used with [Na + ] = 
0.075 M. The maximum depth and probability cutoff 
parameters were D max — 5 and p c = 0.01 in Figs. pM)] 



(21) -Dmax = 3 and p c = 0.02 in Fig.|7J and £> max = 3 and p, 
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0.0001 in Figs. |] [To] The sequence [GenBank:BC039060] 
was used for producing Figs. [2}|6] A systematic test of 
how the execution time depends on -D max and p c has not 
been performed. 



B. Discussion 

For an algorithm to be called efficient, it should solve 
the task at hand with optimal time complexity. It should 
not introduce approximations, that would just amount to 
a reformulation of a simpler, but different task. In this 
study, the task is to compute a stitch profile based on 
the Poland-Scheraga model with Fixman-Freire loop en- 
tropies. With this model, the time complexity must be 
at least 0(N log N). Indeed, this is achieved by Algo- 
rithm 2 under a wide range of conditions. Algorithm 2 
does not acquire a speedup by any windowing approxi- 
mation, by which the sequence would first be split into 
smaller independent sequences. Neither does it rely on 
limiting the problem to a maximal bubble length. There- 
fore, Algorithm 2 is efficient. In computational RNA 
and protein studies, a maximal loop size is sometimes 
imposed as a heuristic for reducing time complexity by 
one order. Similarly, a maximal DNA bubble size of 50 
bp has been reported in computations of low tempera- 
ture bubble probabilities in the Peyrard-Bishop-Dauxois 
model [13J. In contrast, Algorithm 2 can find bubbles 
of whatever size at any temperature. In the 48502 bp 
[GcnBank:NC_001416], for example, bubbles and helical 
regions may be up to around 20000 bp long [IT]. Al- 
though Algorithm 2 has no explicit notion of a maximal 
bubble length, it may implicitly detect length limitations 
for both bubbles and helical regions by the absence of 
the "next bigger goat". In this way, Algorithm 2 can 
adapt to the input sequence. This adaptation is evident 
in Fig. 10 where the bubble execution time grows as 
bubbles get bigger at higher temperatures. Conversely, 
the helix execution time decreases as the helical regions 
gradually melt away. 

However, the time complexity was not proven to be 
0(N log N) under all conditions. It is still an open ques- 
tion whether there is a transition to time complexity 
0{N 2 ) in some peripheral regions of the input parameter 
space. But based on results so far, a fast computation 
would be expected in most situations. 

How fast is Algorithm 2? Figures [7] and [8] show that 
the Perl implementation runs on an old desktop PC at 
the speed of roughly 1000 basepairs per second. With to- 
day's computers, assuming twice that speed and enough 
memory, the E. coli genome would take 39 minutes, the 
yeast genome would take 1.7 hours, and the largest hu- 
man chromosome would take 35 hours. In some types 
of low temperature melting studies, the features of inter- 
est are the bubbles rather than the helical regions. In 
such applications, switching off the computation of helix 



stitches can speed up the algorithm several times. As 



Fig. 10 indicates, the helix time is about twice the bub- 
ble time at helicity equal to 0.95, that is, the speedup 
would be about threefold. The largest human chromo- 
some would be done in ten hours. On a computer clus- 
ter, the human genome could be computed in a day. Such 
bubbles could then be compared to TFBS, TSS, replica- 
tion origins, viral integration sites, etc. 

However, the required memory grows with sequence 
length and for sequences longer than 2 Mbp, more than 
1 GB was needed. The memory usage has not been tested 
further and the space complexity has not been discussed 
in this article. Some memory optimization of the Perl im- 
plementation must be done before such test can reflect 
the space complexity. While the algorithm is efficient in 
terms of time complexity, the code has room for opti- 
mization of both speed and memory usage. However, the 
space complexity is believed to be O(N), which means 
that the algorithm would eventually become out of mem- 
ory for long enough sequences. A standard solution is to 
introduce efficient use of disk space instead, which could 
reduce the memory usage to O(l), without increasing the 
time complexity. 



IV. CONCLUSIONS 

The fast algorithm described in this article enables 
the computation of stitch profiles of genomic sequences. 
Melting features of interest, such as bubbles, helical 
regions, and their boundaries, are computed directly, 
rather than relying on visualization or educated guesses. 
The algorithm is exact. It does not achieve its speed by 
approximations, such as windowing or maximal bubble 
sizes. Genomewide comparisons of bubbles with TSS, 
replication origins, viral integration sites, etc., are pro- 
posed. The algorithm is available in Perl code from the 
author. Online computation of stitch profiles is available 
on our web server, which has recently been upgraded to 
run Algorithm 2 [2"51l4"2"] . 
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